Skip to content

Simplex reconfiguration framework - Part IV (MSM implementation - verification)#381

Open
yacovm wants to merge 2 commits into
mainfrom
reconfig-3c
Open

Simplex reconfiguration framework - Part IV (MSM implementation - verification)#381
yacovm wants to merge 2 commits into
mainfrom
reconfig-3c

Conversation

@yacovm
Copy link
Copy Markdown
Collaborator

@yacovm yacovm commented May 12, 2026

No description provided.

@yacovm yacovm force-pushed the reconfig-3c branch 2 times, most recently from bcb6d88 to d5dd2de Compare May 13, 2026 17:07
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/msm.go Outdated
Comment thread msm/verification.go Outdated
Comment on lines +136 to +138
if prev.NextPChainReferenceHeight == 0 {
return nil
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we need to do some verification, otherwise a malicious node can make us finalize a block without us ever checking NextEpochApprovals.

See this test I asked claude to write.

// TestNextEpochApprovalsVerifierHaltViaGarbageApprovals demonstrates the liveness halt
// that the missing nil-check in verifyNormal enables.
//
// Scenario:
//
//	A: Normal block, prev.NextPChainReferenceHeight == 0.
//	B: Normal block proposed by a malicious leader. B legitimately advances
//	   NextPChainReferenceHeight (validator set changed), and also injects an
//	   all-1s bitmask into NextEpochApprovals. With prev (A).NextPChainReferenceHeight
//	   == 0 the verifier early-returns and B is accepted.
//	B+1: Honest leader. prev (B).NextPChainReferenceHeight > 0 so verifyNormal runs
//	     the full path. areNextEpochApprovalsSignersSupersetOfApprovalsOfPrevBlock
//	     requires B+1 to be a superset of B's all-1s bitmask. No honest proposer
//	     can produce that, so no valid B+1 exists -> chain halts at B.
//
// This test will start failing at step (B) once verifyNormal is fixed to reject
// garbage approvals when prev.NextPChainReferenceHeight == 0, which is the desired
// behavior.
func TestNextEpochApprovalsVerifierHaltViaGarbageApprovals(t *testing.T) {
	validators := NodeBLSMappings{
		{BLSKey: []byte{1}, Weight: 1},
		{BLSKey: []byte{2}, Weight: 1},
		{BLSKey: []byte{3}, Weight: 1},
	}
	getValidators := func(uint64) (NodeBLSMappings, error) { return validators, nil }

	v := &nextEpochApprovalsVerifier{
		sigVerifier:          &testSigVerifier{},
		getValidatorSet:      getValidators,
		keyAggregator:        &testKeyAggregator{},
		sigAggregatorCreator: newSignatureAggregatorCreator(),
	}

	// Malicious B: triggers transition AND injects all-1s garbage approvals.
	garbageApprovals := &NextEpochApprovals{NodeIDs: []byte{0xFF}, Signature: []byte("garbage")}
	mdA := SimplexEpochInfo{NextPChainReferenceHeight: 0, PChainReferenceHeight: 50}
	mdB := SimplexEpochInfo{
		NextPChainReferenceHeight: 100,
		PChainReferenceHeight:     50,
		NextEpochApprovals:        garbageApprovals,
	}

	// Step 1: verify B against A. Today the verifier accepts B. After the fix this
	// should fail with errUnexpectedApprovals and the halt below becomes unreachable.
	errB := v.Verify(verificationInput{
		nextBlockType:   BlockTypeNormal,
		prevMD:          StateMachineMetadata{SimplexEpochInfo: mdA},
		proposedBlockMD: StateMachineMetadata{SimplexEpochInfo: mdB},
	})
	require.NoError(t, errB, "malicious B accepted today; once fixed, expect errUnexpectedApprovals")

	// Step 2: honest B+1 tries to continue. It supplies its own approval (bit 0),
	// which is the realistic best-case for an honest proposer that just observed
	// the transition. The superset check against B's all-1s bitmask rejects it.
	honestApprovals := &NextEpochApprovals{NodeIDs: []byte{0x01}, Signature: []byte("sig")}
	mdBPlus1 := SimplexEpochInfo{
		NextPChainReferenceHeight: 100,
		PChainReferenceHeight:     50,
		NextEpochApprovals:        honestApprovals,
	}

	errBPlus1 := v.Verify(verificationInput{
		nextBlockType:   BlockTypeNormal,
		prevMD:          StateMachineMetadata{SimplexEpochInfo: mdB},
		proposedBlockMD: StateMachineMetadata{SimplexEpochInfo: mdBPlus1},
	})
	require.ErrorIs(t, errBPlus1, errSignerSetShrunk, "chain is wedged: no honest B+1 can satisfy the superset check against B's garbage")
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i think we need this check

	if prev.NextPChainReferenceHeight == 0 {
		if next.NextEpochApprovals != nil {
			return fmt.Errorf("%w but got %v", errUnexpectedApprovals, next.NextEpochApprovals)
		}
		return nil
	}

Comment thread msm/verification.go Outdated
prev, next := in.prevMD.SimplexEpochInfo, in.proposedBlockMD.SimplexEpochInfo

// An epoch number of 0 means this is not a Simplex block, so the next block should be the first Simplex block with epoch number 1.
if in.prevMD.SimplexEpochInfo.EpochNumber == 0 && in.proposedBlockMD.SimplexEpochInfo.EpochNumber != 1 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we ever reach this code path via the msm? if we are in this verifier that means the EpochNumber must be > 0. Is this just a defensive check?

Comment thread msm/verification.go Outdated
Comment on lines +497 to +500
prevBlock, _, err := v.getBlock(in.prevBlockSeq, md.Prev)
if err != nil {
return fmt.Errorf("failed retrieving block: %w", err)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this additional lookup is just to check is prevBlock.InnerBlock != nil. We can remove this lookup if we just pass in the prevBlock we already queried in VerifyBlock

prevMD := prevBlock.Metadata

Comment thread msm/msm.go Outdated
&epochNumberVerifier{},
&validationDescriptorVerifier{
getValidatorSet: func(pChainHeight uint64) (NodeBLSMappings, error) {
return sm.Config.GetValidatorSet(pChainHeight)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to just pass in the validator set for the pChainHeight in the verification input rather than passing in the method to both validationDescriptorVerifier and nextPChainReferenceHeightVerifier? I guess it would also save a lookup too

Comment thread msm/verification.go Outdated
if next.SealingBlockSeq != 0 {
return fmt.Errorf("%w: expected 0 but got %d", errSealingBlockSeqMismatch, next.SealingBlockSeq)
}
case BlockTypeTelock:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a defensive check to make sure the telock case falls under one of these ifs?

switch {
		case prev.SealingBlockSeq > 0:
			if next.SealingBlockSeq != prev.SealingBlockSeq {
				return fmt.Errorf("%w: expected %d but got %d", errSealingBlockSeqMismatch, prev.SealingBlockSeq, next.SealingBlockSeq)
			}
		case prev.BlockValidationDescriptor != nil:
			md, err := simplex.ProtocolMetadataFromBytes(in.prevMD.SimplexProtocolMetadata)
			if err != nil {
				return fmt.Errorf("failed parsing protocol metadata: %w", err)
			}
			if next.SealingBlockSeq != md.Seq {
				return fmt.Errorf("%w: expected %d but got %d", errSealingBlockSeqMismatch, md.Seq, next.SealingBlockSeq)
			}
		default:
			return errInvalidTelockPredecessor

Comment thread msm/msm.go Outdated
Copy link
Copy Markdown
Collaborator

@samliok samliok May 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to construct the verificationInput once before the loop?

@yacovm yacovm force-pushed the reconfig-3c branch 2 times, most recently from a492fe6 to 7d9705d Compare May 19, 2026 21:17
Comment thread msm/misc_test.go
return false
}

type testBlockStore map[uint64]StateMachineBlock
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we should be testing anything other structs in misc.go in this file. When we bridge over these avalanchego utilites defined in misc.go this file should just be deleted.

Unfortunately I think this means moving everything starting from line 181 and onwards.

Comment thread msm/msm.go
errNilInnerBlock = errors.New("InnerBlock is nil")
errBuiltGenesisInnerBlock = errors.New("received a genesis block")
errZeroBlockParentNoInnerBlock = errors.New("failed constructing zero block: parent block has no inner block")
errNilBlock = errors.New("block is nil")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on errNilBlock and errNilInnerBlock being one error?

Comment thread msm/msm.go
errLastNonSimplexInnerBlockNil = errors.New("failed constructing zero block: last non-Simplex inner block is nil")
errInvalidProtocolMetadataSeq = errors.New("invalid ProtocolMetadata sequence number: should be > 0")
errUnknownState = errors.New("unknown state")
errNilInnerBlock = errors.New("InnerBlock is nil")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplex/msm/msm.go

Lines 247 to 250 in 7d9705d

func (sm *StateMachine) VerifyBlock(ctx context.Context, block *StateMachineBlock) error {
if block == nil {
return errNilInnerBlock
}

i think this is being used wrongly here. maybe we should just remove it and use errNilBlock instead

Comment thread msm/msm.go Outdated
errPChainReferenceHeightDecreased = errors.New("P-chain reference height is decreasing")
errValidatorSetUnchanged = errors.New("validator set unchanged; next P-chain reference height should not have advanced")
errPChainHeightNotReached = errors.New("haven't reached referenced P-chain height yet")
errUnknownBlockType = errors.New("unknown block type")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove this error and just use errUnknownState

Comment thread msm/msm.go
errUnknownState = errors.New("unknown state")
errNilInnerBlock = errors.New("InnerBlock is nil")
errBuiltGenesisInnerBlock = errors.New("received a genesis block")
errZeroBlockParentNoInnerBlock = errors.New("failed constructing zero block: parent block has no inner block")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this plus errParentInnerBlockHasNoInnerBlock can be merged into one

Comment thread msm/msm.go Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think this function should be part of StateMachine

Comment thread msm/msm.go
return sm.wrapBlock(parentBlock, innerBlock, newSimplexEpochInfo, decisionToBuildBlock.pChainHeight, simplexMetadata, simplexBlacklist), nil
}

func (sm *StateMachine) verifyNormalBlock(ctx context.Context, parentBlock StateMachineBlock, nextBlock *StateMachineBlock, prevBlockSeq uint64) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on separating out some of the common verification code?

https://github.com/ava-labs/Simplex/compare/reconfig-3c...reconfig-seperations?expand=1

Comment thread msm/msm.go
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated but doesn't this comment in SimplexEpochInfo need to be updated

// PrevSealingBlockHash is the hash of the sealing block of the previous epoch.
	// It is empty for the first epoch, and the second epoch has the PrevSealingBlockHash set to be
	// the hash of the first ever block built by the StateMachine.
	PrevSealingBlockHash [32]byte `canoto:"fixed bytes,3"`

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already fixed it here.

Comment thread msm/msm.go Outdated
if err != nil {
sm.Logger.Error("Error retrieving previous sealing block", zap.Uint64("seq", simplexEpochInfo.EpochNumber), zap.Error(err))
return nil, fmt.Errorf("failed to retrieve previous sealing InnerBlock at epoch %d: %w", simplexEpochInfo.EpochNumber-1, err)
return SimplexEpochInfo{}, fmt.Errorf("failed to retrieve previous sealing InnerBlock at epoch %d: %w", simplexEpochInfo.EpochNumber-1, err)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we do EpochNumber - 1 here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, damn auto-complete

Comment thread msm/msm.go Outdated
pChainHeightBuff := make([]byte, 8)
binary.BigEndian.PutUint64(pChainHeightBuff, pChainHeight)

if err := sm.SignatureVerifier.VerifySignature(next.NextEpochApprovals.Signature, pChainHeightBuff, aggPK); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a domain separator to this signature? similarly to what we do for quorum certificates?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I was aware of it but kept postponing it because we only introduce signing at the next PR so didn't felt like it's needed at this point, but might as well do it now

yacovm added 2 commits May 21, 2026 19:02
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Copy link
Copy Markdown
Collaborator

@samliok samliok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was one thing confusing me but I couldn't put my finger on it. I think the question that I have been stumped on is "How do we find the first ever simplex block?"

I know we have the FirstEverSimplexBlock callback defined in StateMachine but I don't get how this function would be implemented in practice. We would need to initialize our engine to find the first ever simplex block which is either going to be seq 1 or seq lastSnowmanBlock + 1. If we started from genesis this is easy, but I'm unsure of how we plan on storing lastSnowmanBlock? Iterating through all the blocks to find this seems slow.

Also if I'm a non-validator looking to replicate epoch 1, I must validate the first ever simplex block before any other blocks in that epoch. But the question again is "how can i query for the first ever simplex block"? If another epoch has been created then I can just use PrevSealingBlockHash, but if no other epoch has been created what do I do?

// PrevSealingBlockHash is the hash of the sealing block of the previous epoch.
	// It is empty for the first epoch, and the second epoch has the PrevSealingBlockHash set to be
	// the hash of the first ever block built by the StateMachine.
	PrevSealingBlockHash [32]byte `canoto:"fixed bytes,3"`

Copy link
Copy Markdown
Collaborator

@samliok samliok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another question regarding the MSM + non validators. The MSM is to "hijack" BlockBuilding and Verification. For non-validators we do not care about BlockBuilding but we probably want to use the MSM for the same verification logic. Does it make sense to re-use the MSM struct for both non-validators and validators, even though non-validators do not care about BlockBuilding

@samliok
Copy link
Copy Markdown
Collaborator

samliok commented May 22, 2026

There was one thing confusing me but I couldn't put my finger on it. I think the question that I have been stumped on is "How do we find the first ever simplex block?"

I know we have the FirstEverSimplexBlock callback defined in StateMachine but I don't get how this function would be implemented in practice. We would need to initialize our engine to find the first ever simplex block which is either going to be seq 1 or seq lastSnowmanBlock + 1. If we started from genesis this is easy, but I'm unsure of how we plan on storing lastSnowmanBlock? Iterating through all the blocks to find this seems slow.

Also if I'm a non-validator looking to replicate epoch 1, I must validate the first ever simplex block before any other blocks in that epoch. But the question again is "how can i query for the first ever simplex block"? If another epoch has been created then I can just use PrevSealingBlockHash, but if no other epoch has been created what do I do?

// PrevSealingBlockHash is the hash of the sealing block of the previous epoch.
	// It is empty for the first epoch, and the second epoch has the PrevSealingBlockHash set to be
	// the hash of the first ever block built by the StateMachine.
	PrevSealingBlockHash [32]byte `canoto:"fixed bytes,3"`

It would be nice in some regards(and bad in others) if the first ever epoch was the sequence number of the first simplex block. In the genesis case the first epoch will be epoch 1. In the case of 100 snowman blocks -> simplex, the first epoch will be epoch 101 because the sequence of the first simplex block is 101.

This would be nice because if we are in the first epoch we can now know which sequence number to fetch to get to the start of the epoch. However, because the epoch number is not 1, we don't know for sure if we are in the first epoch. This is an easy problem because we can just check if that sequence is a sealing block, or check if PrevSealingBlockHash is empty & BlockValidatorDescriptor is present. In these cases we know we have hit the first ever simplex block.

Another nice benefit of this change is the branching in createSealingBlock can go away PLUS the weird callback that I was confused about in my earlier comment. This is because the logic in the if simplexEpochInfo.EpochNumber > 1 is applicable to both scenarios

	// If this is not the first epoch, and this is the sealing block, we set the hash of the previous sealing block.
	if simplexEpochInfo.EpochNumber > 1 {
		prevSealingBlock, finalization, err := sm.GetBlock(simplexEpochInfo.EpochNumber, [32]byte{})
		if err != nil {
			sm.Logger.Error("Error retrieving previous sealing block", zap.Uint64("seq", simplexEpochInfo.EpochNumber), zap.Error(err))
			return nil, fmt.Errorf("failed to retrieve previous sealing InnerBlock at epoch %d: %w", simplexEpochInfo.EpochNumber-1, err)
		}
		if finalization == nil {
			sm.Logger.Error("Previous sealing block is not finalized", zap.Uint64("seq", simplexEpochInfo.EpochNumber))
			return nil, fmt.Errorf("previous sealing InnerBlock at epoch %d is not finalized", simplexEpochInfo.EpochNumber-1)
		}
		simplexEpochInfo.PrevSealingBlockHash = prevSealingBlock.Digest()
	} else { // Else, this is the first epoch, so we use the hash of the first ever Simplex block.
		firstSimplexBlock := sm.FirstEverSimplexBlock()
		if firstSimplexBlock == nil {
			return nil, fmt.Errorf("first ever Simplex block is not set, but attempted to create a sealing block for the first epoch")
		}
		simplexEpochInfo.PrevSealingBlockHash = firstSimplexBlock.Digest()
	}

These are just my thoughts, not a concrete suggestion but they could be worth thinking about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants